Why do we use charts to tell stories?
Evidence-based visual perception theory
Advice on choosing charts
Advice on using colour in charts
Using this advice to tell stories with charts built with {ggplot2}
A picture is worth a thousand words
There is considerable experimental evidence for data visualisations improving:
Comprehension of data
Decision making accuracy and confidence
Evidence has been collected using eye-tracking, survey filling and interviews.
For a good overview of the available research see Eberhard 20211.
Some of these studies consider tables to be a type of data visualisation.
I agree with this! Tables are often awesome choices for presenting data - let’s talk more about this later today.
In 1973 Anscombe2 published a paper designed to demonstrate…
Graphs are essential to good statistical analysis.
To do so he simulated 4 datasets sharing many identical statistical properties.
However, if you visualised the datasets it was obvious these datasets were fundamentally different to one another.
These charts are now known as Anscombe’s quartet2.
The “Datasaurus Dozen” is a modern reimagining of the original quartet3.
Datasaurus was originally created by Alberto Cairo4.
… there’s now an R package for building your own metamers eliocamp.github.io/metamer/
There are several historical visualisations that have fundamentally changed social policy and behaviour.
This is a map from John Snow in 18555 that ties a cholera outbreak to a specific water pump.
Combined with Snow’s statistical analyses this was a significant step towards the development and acceptance of germ theory.
In exactly the same year, Florence Nightingale6 was creating charts to demonstrate the importance of basic sanitation in military hospitals.
This specific chart is very dramatic and quite rarely used. It’s a polar area diagram or a Nightingale rose diagram
But it’s important to acknowledge that Nightingale used many different types of charts in her work.
Her charts and analyses were central to bringing basic sanitation standards to nursing and hospitals.
In 2006 Hans Rosling7 gave an incredible TED talk where he introduced animated bubble charts as a tool to tell stories about global development.
These charts helped demonstrate the value of interactive and animated data visualisations - which is why Google bought the tool behind the charts!
A more recent example of a very powerful data visualisation is the spiralling global temperature GIF from 2016 by Ed Hawkins8.
We can create animated GIF with {ggplot2} via the {gganimate} package. In fact, Pat Schloss9 has a YouTube video and GitHub repo recreating this chart with R.
There is a wealth of evidence-based research in how precisely or accurately charts are perceived by readers.
Our evidence comes from:
Eye tracking. We’re really good at measuring where the eye is looking, for how long and how intently.
Asking trial participants to estimate or compare values in charts.
There are open debates1 on how our internal visual perception system works - what the brain is doing.
1A good example is pie charts where we’re still not sure what our brains are doing, but we know they’re not measuring area thanks to Robert Kosara10
Back in 1984 Cleveland & McGill11 published their seminal paper on graphical perception theory where they defined “elementary perceptual tasks”.
This study is the backbone of much of the research in this field.
Cleveland & McGill11 designed many experiments where participants were asked to:
Identify the largest/smallest segment
Estimate what % the smaller segment was of the larger segment
The accuracy of subject estimates was then statistically analysed.
Images from Beecham et al13
Images from Robert Kosara14
Images from Robert Kosara14
Image found on Twitter from @irg_bio15 - code for chart available from GitHub16.
To extract accurate values
The magnitude of chart elements.
To quantatively compare values.
The part to whole or relative magnitude of chart elements.
To find the largest/smallest value.
The ranking of chart elements
To find unusual values.
The distribution, ranking or magnitude of chart elements
You have a story you want to tell
There’s lots we can do to help guide the reader to understand your chart and follow the story you’re telling. We’ll cover some examples during this course.
The reader wants to see the data
Charts (and tables) are the best way to see the “big picture” of a dataset - a single value (eg mean) is kind of useless. Interactivity is really useful to allow readers to properly explore the dataset.
The reader has a preconception about the data
Readers might be approaching a chart biased with a particular theory about the data. We can do our best to make our charts easy to read and avoid common pitfalls.
This site also provides simple to follow instructions for using {ggplot2} to build every single chart type you can find on the website.
The Visual Vocabulary is a really useful tool for thinking about how to tell your story with a chart.
Lots of the dataviz at the FT is done with R. John Burn-Murdoch17 is a great source to follow.
{ggplot2} is an incredibly powerful and flexible tool for building static dataviz.
We can build (almost)1 any static chart we can conceive of.
[1] - Dual y-axis charts must be transformations of one another (for good reasons)
Aesthetics
Geoms
Scales
Guides
Theme
| Where is aes() placed? | What it does |
|---|---|
Inside ggplot() or on its own |
Sets the aesthetics for the entire {ggplot2} object. These could be considered the coordinate system aes() |
Inside geom_*() |
Sets aesthetics for a specific geom within the existing coordinate system aes() for the {ggplot2} object. These should be considered geom specific aes() |
Geoms use the aesthetics to add layers to our charts.
There are 50+ geoms baked into the {ggplot2} package.
geom_abline(), geom_area(), geom_bar(), geom_bin2d(), geom_blank(), geom_boxplot(), geom_col(), geom_contour(), geom_contour_filled(), geom_count(), geom_crossbar(), geom_curve(), geom_density(), geom_density_2d(), geom_density_2d_filled(), geom_density2d(), geom_density2d_filled(), geom_dotplot(), geom_errorbar(), geom_errorbarh(), geom_freqpoly(), geom_function(), geom_hex(), geom_histogram(), geom_hline(), geom_jitter(), geom_label(), geom_line(), geom_linerange(), geom_map(), geom_path(), geom_point(), geom_pointrange(), geom_polygon(), geom_qq(), geom_qq_line(), geom_quantile(), geom_raster(), geom_rect(), geom_ribbon(), geom_rug(), geom_segment(), geom_sf(), geom_sf_label(), geom_sf_text(), geom_smooth(), geom_spoke(), geom_step(), geom_text(), geom_tile(), geom_violin(), geom_vline()
As we’ll see later, there are many {ggplot2} extension packages that add even more geoms to the mix.
But geom_bar() itself is built from geom_rect().
There are 8 primitives from which all other geoms are built:
geom_blank(), geom_path(), geom_point(), geom_polygon(), geom_rect(), geom_ribbon(), geom_segment(), geom_text()
x and y aestheticsThese tell the geom where it needs to be drawn:
x and yLet’s geom_segment() to visualise some of the eras of the dinosaurs:
To build this chart we need to specify all of the following: x, xend, y and yend.
size to affect geom sizeIn many charts we want geoms to be thicker, bigger or just be more prominent.
Timeline (or Gantt charts) are good examples of this. We want the segments to be thicker to improve the readability of the chart - this comes down to the size aesthetic.
This is still a bad chart.
The eras are not ordered in geological time, instead they’re ordered (reverse) alphabetically.
To control the order of things in {ggplot2} charts we must use factors - which are picked up by the scales.
stat functionsThe geom_bar() function has a stat argument with the default value of "count".
We can force the geom to behave like geom_col() by changing the stat:
All of the goodness from the stat argument comes from the stat_identity() and stat_count() functions.
If you’re building a complex chart it might be useful to directly call a stat_() function.
Box and whisker diagrams hide a lot of detail
Let’s add the data points to this chart with geom_point() and look at the position argument.
The position argument can also be used to create three different types of bar chart:
“stack” creates a stacked bar chart
“fill” creates a proportional bar chart
“dodge” creates a grouped bar chart
Let’s create all 3 of these for the following dataset:
# A tibble: 78 × 3
relig marital n
<fct> <fct> <int>
1 No answer No answer 4
2 No answer Never married 22
3 No answer Separated 3
4 No answer Divorced 13
5 No answer Widowed 7
6 No answer Married 44
7 Don't know Never married 6
8 Don't know Separated 3
9 Don't know Divorced 1
10 Don't know Married 5
# … with 68 more rows
The geom_smooth() line is hiding data points.
We could either swap the order of these geoms or change the alpha aesthetic.
Scales determine the appearance of an aesthetic within the chart, including:
Which colours are used for each value
Which order the values appear in the chart and guides
{ggplot2} uses tidy evaluation to allow us to use bare column names in our code.
Some rules to follow:
Don’t use pies for more than a few groups (ideally 2)
If using bubble charts vary by area instead of radius
Colour schemes matter A LOT
Dataviz are demonstrably awesome
Visual Perception Theory
Let’s make some actual ggplot2 charts / Grammar of Graphics
Deciding on charts (FT Visual Vocabulary)
Things to avoid / Advice (does this belong in visual perception theory?)
Starting at zero
Lots of different types of charts
Dynamite charts
Sucking quote
“There is no way of knowing nothing about a subject to knowing something about a subject without going through a period of much frustration and suckiness.” “Push through. You’ll suck less.” Hadley Wickham, author of ggplot2
Tables
Factors.